Clustering orthologous proteins across phylogenetically distant species.
نویسندگان
چکیده
The quality of orthologous protein clusters (OPCs) is largely dependent on the results of the reciprocal BLAST (basic local alignment search tool) hits among genomes. The BLAST algorithm is very efficient and fast, but it is very difficult to get optimal solution among phylogenetically distant species because the genomes with large evolutionary distance typically have low similarity in their protein sequences. To reduce the false positives in the OPCs, thresholding is often employed on the BLAST scores. However, the thresholding also eliminates large numbers of true positives as the orthologs from distant species likely have low BLAST scores. To rectify this problem, we introduce a new hybrid method combining the Recursive and the Markov CLuster (MCL) algorithms without using the BLAST thresholding. In the first step, we use InParanoid to produce n(n-1)/2 ortholog tables from n genomes. After combining all the tables into one, our clustering algorithm clusters ortholog pairs recursively in the table. Then, our method employs MCL algorithm to compute the clusters and refines the clusters by adjusting the inflation factor. We tested our method using six different genomes and evaluated the results by comparing against Kegg Orthology (KO) OPCs, which are generated from manually curated pathways. To quantify the accuracy of the results, we introduced a new intuitive similarity measure based on our Least-move algorithm that computes the consistency between two OPCs. We compared the resulting OPCs with the KO OPCs using this measure. We also evaluated the performance of our method using InParanoid as the baseline approach. The experimental results show that, at the inflation factor 1.3, we produced 54% more orthologs than InParanoid sacrificing a little less accuracy (1.7% less) than InParanoid, and at the factor 1.4, produced not only 15% more orthologs than InParanoid but also a higher accuracy (1.4% more) than InParanoid.
منابع مشابه
Conservation versus parallel gains in intron evolution
Orthologous genes from distant eukaryotic species, e.g. animals and plants, share up to 25-30% intron positions. However, the relative contributions of evolutionary conservation and parallel gain of new introns into this pattern remain unknown. Here, the extent of independent insertion of introns in the same sites (parallel gain) in orthologous genes from phylogenetically distant eukaryotes is ...
متن کاملComputer Analysis of the GCN4 Regulon of Yeast Saccharomyces cerevisiae
Binding sites of eukaryotic transcriptional regulators are often very short, and the specificity of recognition is attained by cooperative binding of regulators to clusters of sites. We analyzed clustering of binding sites of the global regulator of amino acid metabolism Gcn4p in regulatory regions of nine genes of yeast Saccharomyces cerevisiae and of orthologous genes of a phylogenetically qu...
متن کاملNewly Identified Motifs within PAS Domains of Filamentous Cyanobacteria
Many cyanobacterial genome projects have been finished or are ongoing. Now, complete genome sequence of Anabaena sp. PCC 7120 (Anabaena)[1] and draft genome sequence of Nostoc punctiforme (Nostoc)[3] are available. Phylogenetically, these species are closely related. These species have extremely abundant signal-transduction proteins, especially PAS domain-containing proteins (PAScontaining prot...
متن کاملOrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups
The OrthoMCL database (http://orthomcl.cbil.upenn.edu) houses ortholog group predictions for 55 species, including 16 bacterial and 4 archaeal genomes representing phylogenetically diverse lineages, and most currently available complete eukaryotic genomes: 24 unikonts (12 animals, 9 fungi, microsporidium, Dictyostelium, Entamoeba), 4 plants/algae and 7 apicomplexan parasites. OrthoMCL software ...
متن کاملUnique Evolution of Symbiobacterium thermophilum Suggested from Gene Content and Orthologous Protein Sequence Comparisons
Comparisons of gene content and orthologous protein sequence constitute a major strategy in whole-genome comparison studies. It is expected that horizontal gene transfer between phylogenetically distant organisms and lineage-specific gene loss have greater influence on gene content-based phylogenetic analysis than orthologous protein sequence-based phylogenetic analysis. To determine the evolut...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Proteins
دوره 71 3 شماره
صفحات -
تاریخ انتشار 2008